Review: MC Control Methods

In the previous lesson, you learned about the control problem in reinforcement learning and implemented some Monte Carlo (MC) control methods.

Control Problem: Estimate the optimal policy.

In this lesson, you will learn several techniques for Temporal-Difference (TD) control.

## Review

Before continuing, please review Constant-alpha MC Control from the previous lesson.

Remember that the constant-\alpha MC control algorithm alternates between policy evaluation and policy improvement steps to recover the optimal policy \pi_*.

In the policy evaluation step, the agent collects an episode S_0, A_0, R_1, \ldots, S_T using the most recent policy \pi. After the episode finishes, for each time-step t, if the corresponding state-action pair (S_t,A_t) is a first visit, the Q-table is modified using the following update equation:

Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha(G_t - Q(S_t, A_t))

where G_t := \sum_{s={t+1}}^T\gamma^{s-t-1}R_s is the return at timestep t, and Q(S_t,A_t) is the entry in the Q-table corresponding to state S_t and action A_t.

The main idea behind this update equation is that Q(S_t,A_t) contains the agent's estimate for the expected return if the environment is in state S_t and the agent selects action A_t. If the return G_t is not equal to Q(S_t,A_t), then we push the value of Q(S_t,A_t) to make it agree slightly more with the return. The magnitude of the change that we make to Q(S_t,A_t) is controlled by the hyperparameter \alpha>0.